MICROCOPY  RESOLUTION  TEST  CHART 

NATIONAL  Bureau  OF  STAN0A«05-t965- a 


Mathematics  Research  Center 
University  of  Wisconsin-Madison 
610  Walnut  Street 
Madison,  Wisconsin  53706 

January  1983 


(Received  September  8,  1982) 


Approved  (or  public  release 
Distribution  unlimited 


DTIC 

SELECTE|» 

MAY  0  6  19831  ■ 


U.  S.  Army  Research  Office 
P.  O.  Box  12211 
Research  Triangle  Park 
North  Carolina  27709 


National  Cancer  Institute 
9000  Rockville  Pike 
Bethesda,  MD  20205 


Educational  Testing  Service 
Carter  Road 
Princeton,  NJ  08541 


8a  05  06-135 


I 


UNIVERSITY  OF  WISCONSIN  -  MADISON 
MATHEMATICS  RESEARCH  CENTER 

BALANCED  SUBCLASSIFICATION  IN  OBSERVATIONAL  STUDIES 
USING  THE  PROPENSITY  SCORE:  A  CASE  STUDY 

Paul  R.  Rosenbaum*  and  Donald  B.  Rubin** 

Technical  Summary  Report  #2468 


January  1983 
ABSTRACT 


The  propensity  score  is  the  conditional  probability  o£  assignment  to  a 
particular  treatment  given  a  vector  of  observed  covariatea.  Previous 
theoretical  arguments  have  shown  that  subclassification  on  the  scalar 
propensity  score  will  balance  all  observed  covariates.  The  procedure  is 
illustrated  in  a  large  observational  study  of  treatments  for  coronary  artery 
disease.  Five  subclasses  are  constructed  that  balance  74  covariates. 

Balanced  subclassification  is  combined  with  model-based  adjustments  to  provide 
estimates  of  treatment  effects  within  subpopulations.  Two  appendices  address 
theoretical  issues:  -iAT  propensity  scores  from  incomplete  data,  and  4BT^the 
effectiveness  of  subclassification  on  the  propensity  score. 


AMS  (MOS)  Subject  Classifications:  62F99;  62H99;  62P10;  62H17 
Key  Words:  Observational  studies:  bias  reduction;  stratification;  logistic 
models;  log  linear  models;  direct  adjustment;  balancing  scores. 
Work  Unit  Number  4  -  Statistics  and  Probability 


Departments  of  Statistics  and  Human  Oncology,  University  of  Wisconsin- 
Madison. 

*• 

Departments  of  Statistics  and  Education,  University  of  Chicago. 

Sponsored  in  part  by  the  United  States  Army  under  Contract  No.  DAAG29-80-C- 
0041,  Grant  P30-CA-14520  from  the  U.s.  National  Cancer  Institute  to  the 
Wisconsin  Clinical  Cancer  Center,  and  by  the  Wisconsin  Alumni  Research 
Foundation,  the  Educational  Testing  Service,  the  U.S.  Health  Resources 
Administration. 


SIGNIFICANCE  AND  EXPLANATION 


In  observational  studies/  treatments  are  assigned  to  experimental  units 
without  the  benefit  of  randomization •  Aa  a  result,  the  units  receiving  the 
various  treatments  may  not  be  comparable  with  respect  to  observable  or 
unobservable  characteristics.  A  detailed  example  that  builds  on  previous 
theoretical  arguments  shows  that  it  is  possible  to  form  a  few  groups  or 
subclasses  of  units  such  that,  within  each  subclass,  units  receiving  different 
treatments  are  comparable  with  respect  to  the  distribution  of  observable 
characteristics . 


Accession  For 

NT IS  GRA&I 

x 

DTIC  TAB 

Unannounced 
Justification _ 

□ 

By 

distribution/ 

Aval lability  Code3 
(Avail  and/or 


Dist 


A 


Special 


The  responsibility  for  the  wording  and  views  expressed  in  this  descriptive 
suasaary  lies  with  HRC,  and  not  with  the  authors  of  this  report. 


i 


BALANCED  SUBCLASSIFICATION  IN  OBSERVATIONAL  STUDIES 
USING  THE  PROPENSITY  SCORE:  A  CASE  STUDY 

Paul  R.  Rosenbaum*  and  Donald  B.  Rubin** 

1 .  Introduction:  Subclaaslf lcatlom  the  Propensity  Score:  a  Case  Study 
1.1.  Adjustment  by  Subclasslf lcatlon  In  Observational  Studies 

In  observational  studies  Cor  causal  effects,  treatments  are  assigned  to  experimental 
units  without  the  benefits  of  randomisation.  As  a  result,  treatment  groups  may  differ 
systematically  with  respect  to  relevant  characteristics,  and  therefore  not  be  directly 
comparable.  One  commonly  used  method  of  controlling  for  systematic  differences  involves 
grouping  units  into  subclasses  based  on  observed  characteristics,  and  then  directly 
comparing  only  treated  and  control  units  who  fall  in  the  same  subclass. 

Cochran  (1968)  presents  an  example  in  which  the  mortality  rates  of  cigarette  smokers, 
cigar/pipe  smokers  and  nonemokers  are  coshered  after  subclassification  on  the  covariate 
age.  The  age-adjusted  estimates  of  the  average  mortality  for  each  type  of  smoking  were 
found  by  direct  adjustment,  that  is,  by  combining  the  subclass-specific  mortality  rates 
using  weights  equal  to  the  proportions  of  the  population  within  the  subclasses.  Cochran 
( 1968)  shows  that  five  subclasses  are  often  sufficient  to  remove  over  90t  of  the  bias  due 
to  the  subclassifying  variable  or  covariate.  However,  as  noted  by  Cochran  (1965),  as  the 
number  of  covariatea  increases,  the  number  of  subclasses  grows  exponentially,  so  that  even 
with  only  two  categories  per  covariate,  yielding  2P  subclasses  with  p  covariates,  some 
subclasses  will  contain  no  units,  and  many  subclasses  will  contain  either  treated  or 
control  units  but  not  both,  making  it  impossible  to  form  directly  adjusted  estimates  for 
the  entire  population. 


Departments  of  Statistics  and  Human  Oncology,  University  of  Misconsin-Madison. 

•• 

Departments  of  Statistics  and  Education,  University  of  Chicago. 


Sponsored  in  part  by  the  United  8tates  Army  under  Contract  No.  DAAG29-80-C-0041 ,  Grant  P30 
CA- 14520  from  the  U.  S.  National  Cancer  Institute  to  the  Wisconsin  Clinical  Cancer  Center, 
and  by  the  Wisconsin  Alumni  Research  foundation,  the  Educational  Testing  Service,  the  U.  S 
Health  Resources  Administration. 


i 


Fortunately,  however,  there  exlete  a  scalar  function  of  the  covariatee,  namely  the 
propenelty  ecore,  that  auaeurlzea  the  information  required  to  balance  the  dletributlon  of 
the  covariatee.  Specifically,  subclasses  foraed  from  the  scalar  propensity  score  will 
balance  all  p  covariatee.  In  fact,  often  five  subclasses  constructed  f row  the  propensity 
score  will  suffice  to  remove  over  90%  of  the  bias  due  to  each  of  the  covariatee. 

1.2.  The  Propensity  Score  in  Observational  Studies 

In  a  study  coopering  two  treatments,  labeled  1  and  0,  the  propensity  score, 
e(x),  is  the  conditional  probability  that  a  unit  with  vector  x  of  observed  covariatee 
will  be  assigned  to  treatment  1,  e(x)  “  pr(*»1|x),  where  s  -  1  or  0  indicates  the 
treatment  assignment.  Rosenbaum  and  Rubin  (1983a,  ttieorem  1)  show  that  subclassification 
on  the  population  propensity  score  will  balance  x,  in  the  sense  that  within  subclasses 
that  are  homogeneous  in  e(x),  the  distribution  of  the  observed  covariatee  x  is  the  same 
for  treated  and  control  units.  Formally,  x  and  s  are  conditionally  independent  given 
e(x),  or  in  Dawid'a  (1979)  notations 

x  JJ.  *!•<£>  •  (U 

The  propensity  score  will  balance  x  whether  or  not  x  includes  all  of  the  covariates 
used  to  assign  treatments. 

Under  the  assumption  of  strongly  ignorable  treatment  assignment,  defined  by  Rosenbaum 
and  Rubin  (1983a),  appropriate  adjustment  for  the  propensity  score  alone  will  produce 
unbiased  estimates  of  treatment  effects,  one  way  in  which  treatment  assignment  would  be 
strongly  ignorable  is  if,  at  each  value  of  the  observed  covarlates  x,  treatments  are 
assigned  randomly  with  positive  probability  to  each  treatment.  A  method  for  assessing  the 
sensitivity  of  conclusions  to  violations  of  this  assumption  of  strong  ignorability  has  been 
described  by  Rosenbaum  and  Rubin  (1983b)  in  a  particular  case;  methods  of  testing  strong 
ignorability  have  been  reviewed  by  Rosenbaum  (1982). 

Subclassification  on  the  propensity  score  is  not  the  same  as  any  of  the  several 
methods  proposed  by  Miettinen  ( 1 976 ) ■  the  propensity  score  is  not  generally  a  "confounder" 
ecore.  See  Rosenbaum  and  Rubin  (1983a,  §3.3)  for  discussion. 


-2- 


< 


1.3.  A  Casa  Study  of  Balanced  Subclasslf Ication 


In  thia  paper,  we  illustrate  balanced  subclasslf ication  on  the  propensity  score  in  an 
observational  study  of  two  treatments  for  coronary  artery  disease:  coronary  artery  bypass 
surgery  (r-1)  and  medical  therapy  (*»0 ) .  The  vector  of  covariates  x  contains  74 
hemodynamic,  angiographic,  laboratory  and  exercise  test  results.  The  data  analysis  that 
follows  is  intended  for  purposes  of  illustration,  and  does  not  constitute  a  study  of 
coronary  bypass  surgery. 

The  propensity  score  was  estimated  using  a  logit  model  (Cox  1970)  for  z, 

e(x) 

109  wuf  “  a  *  &  t(*> 

where  a  and  Jg  are  parameters,  and  f(*)  is  a  specified  function. 

2.  Fitting  the  Propensity  Scorai  Assessing  the  Balance  Within  Subclasses 
2.1.  The  First  Fit  and  Subclaaslflcatlon 

Mot  all  of  the  74  covaraites  and  their  Interactions  were  included  in  the  logit  model 
for  the  1515  patients  in  the  study.  Variablam  warm  selected  for  inclusion  in  the  logit 
model  using  a  stepwise  procedure.  A  second  stepwise  selection  added  cross-products  or 
interactions  of  those  variables  that  were  selected  by  the  first  stepwise  procedure. 

Based  on  Cochran's  (1968)  results  and  a  new  result  in  Appendix  B  of  this  paper,  we  may 
expect  approximately  a  90%  reduction  in  bias  for  each  of  the  74  variables  if  we  subclassify 
at  the  quintiles  of  the  distribution  of  the  population  propensity  score.  Consequently,  we 
subclaasified  at  the  quintiles  of  the  distribution  of  the  estimated  propensity  score  based 
on  this  Initial  analysis,  which  we  term  the  first  model. 

He  now  examine  the  balance  achieved  by  this  aubclassif ication.  Bach  of  the  74 
covariates  was  subjected  to  a  two-way  (2x5  -  treatment  x  subclass)  analysis  of 
variance.  Column  1  of  Table  1  displays  the  F-ratios,  that  is,  the  squares  of  the  usual  two 
sample  t-statistics  for  coaqparing  the  medical  and  surgical  group  means  for  each  covariate 
prior  to  subclassification.  Columns  2  and  3  display  the  F-ratios  for  the  main  effect  of 
the  treatment  and  the  treatment-by-subclass  interaction  in  the  two-way  analysis  of 

-3- 


4 


I 


t 

C 

0 

r* 

r- 

CO 

O' 

CM 

cn 

m  o 

CM 

© 

O' 

CO 

© 

CM 

10  •- 

CM 

r- 

©  cn 

© 

© 

© 

m 

o 

m 

CM 

cn 

© 

CM 

• 

P 

p 

o 

o 

o 

o 

o 

CM 

o 

O 

CM 

o 

o 

V 

C 

u 

i 

M 

« 

rl 

3 

e 

p 

u 

o 

o 

o 

CM 

o 

o 

CM 

10  CM 

<r* 

© 

V- 

o 

*• 

o 

o  cn 

— 

*-  CM 

m 

© 

CM 

o 

CM 

CM 

CM 

o 

in 

«- 

lx 

« 

%4 

o 

o 

o 

o 

T- 

CM 

o 

*—  *— 

o 

o 

O 

© 

o 

o 

©  o 

© 

©  o 

© 

o 

© 

o 

O 

o 

C 

CM 

Z  «M 

w 

t 

e 

0 

O' 

© 

a 

CO 

a1 

in 

O' 

co  cn 

r- 

CM 

CM 

V0 

^  r- 

P- 

O' 

o  in 

in 

O 

p* 

cn 

© 

© 

© 

© 

© 

m 

© 

* 

P 

p 

T- 

•• 

CM 

CM 

•— 

o 

cn 

c 

u 

H 

H 

a 

8« 

(5 

fH 

0 

H 

*2 

p 

Oh 

i 

C 

u 

o 

o 

o 

*- 

© 

CM 

O'  o 

T- 

CO 

O 

o 

f“ 

o 

cm  m 

CM 

CM 

*-  cn 

cn 

© 

© 

CM 

m 

CM 

© 

CM 

© 

•» 

B 

i 

MH 

T“ 

CM 

-  - 

i 

M 

e 

L 

e 

0 

o 

CO 

VO 

CM 

O 

O 

.6 

7 

© 

vo 

m 

in 

V0 

in 

cn  O' 

< 0 

r* 

©  © 

cn 

cn 

CM 

© 

r> 

© 

CM 

CM 

© 

CM 

© 

© 

E 

P 

p 

«- 

1 

CM 

CM 

e 

o 

* 

M 

a 

§ 

«H 

« 

H 

43 

s 

1 

£ 

p 

8 

© 

o 

© 

w~ 

CO 

• 

<n 

• 

o 

• 

o  o 
•  • 

CM 

* 

o 

• 

o 

• 

o 

• 

o 

• 

o 

• 

© J2 

• 

CO 

• 

©  © 

•  • 

cn 

# 

cn 

• 

O 

• 

© 

• 

o 

• 

cn 

• 

cn 

• 

© 

• 

f" 

• 

•— 

• 

© 

- 

© 

£ 

S 

<W 

tn 

m 

cn 

M 

M 

CD 

8 

L 

C 

0 

CM 

<0 

M3 

CM 

CO 

a 

©  CO 

CM 

CM 

r- 

m 

0} 

© 

r*  © 

© 

CM 

©  © 

CM 

CM 

r* 

© 

© 

a1 

cn 

*T 

© 

o 

© 

3 

P 

p 

r- 

CM 

* 

c 

0 

w 

4k 

M 

a 

tk* 

«H 

o 

w 

1 

e 

p 

0 

© 

o 

© 

- 

in 

«0 

CM 

cm  m 

« 

i0 

© 

© 

V" 

O  P* 

o 

O 

©  O 

in 

© 

w- 

o 

cn 

© 

© 

O 

© 

2 

s 

<4-4 

** 

CM  *- 

© 

r* 

© 

© 

* 

- 

m 

cn 

9— 

w 

i 

o 

•rt 

• 

P 

r-4 

« 

8*2 

© 

GO 

e 

n 

n 

o 

O'  10 

CO 

© 

O 

CO 

cn 

© 

©  CM 

© 

© 

CM  CM 

© 

CM 

© 

© 

© 

© 

© 

© 

© 

o 

© 

o 

5 

a 

© 

CO 

© 

in 

m 

r* 

VO 

o  *- 

V0 

CO 

O' 

© 

r* 

© 

co  o 

* 

©  O 

r* 

© 

© 

© 

© 

o 

o 

* 

© 

0 

m 

cn 

© 

1 

a 

CM  «- 

cn 

CM 

CM 

*• 

i  a 


-4- 


f 


Table  1  (Continued) 


•Figures  3,  4  and  5  refer  to  variables  (N01.N02),  LVCDAC  and  LMCJSIG,  respectively.  See  $2.2 
♦indicates  a  variable  that  was  missing  for  at  least  30*  of  patients.  See  $2.4  and  Appendix  A 
-Not  included  in  early  analyses. 


I 


variance.  Although  thare  has  been  a  substantial  reduction  in  most  F-ratioa  as  compared 
with  column  1,  several  of  the  F-ratios  are  still  quite  large,  possibly  indicating  that  the 
propensity  score  is  poorly  estimated  by  the  current  model.  Indeed,  as  a  consequence  of 
Theorem  1  of  Rosenbaum  and  Rubin  (1983a),  each  such  F-test  is  an  approximate  test  of  the 
adequacy  of  the  model  for  the  propensity  scorei  the  test  is  only  approximate  primarily 
because  the  subclasses  are  not  exactly  homogeneous  in  the  fitted  propensity  score. 

2.2.  Refinement  of  the  Fitted  Propensity  Scorei  Balance  Obtained  in  the  Final 
Subclassification 

Columns  4  through  7  describe  subclasses  based  on  a  sequence  of  models.  At  each  step, 
variables  with  large  F-ratios  that  had  previously  been  excluded  from  the  model  were 
added.  For  variables  with  large  F-ratios  that  had  previously  been  included  in  the  model, 
cross-product  terms  were  added.  The  F-ratios  for  the  final  model  appear  in  columns  8  and 
9,  and  are  plotted  in  Figures  1  and  2.  There  is  considerably  greater  balance  within  these 
final  subclasses  than  would  have  been  expected  from  randomized  assignment  to  treatment 
within  subclasses. 

Figures  3,  4  and  5  display  the  balance  within  subclasses  for  three  important 
covariatee.  Although  the  procedure  used  to  form  the  subclasses  may  not  be  accessible  to 
some  nonetatisticians,  the  comparability  of  patients  within  subclasses  can  be  examined  with 
the  simplest  methods,  such  as  the  bar  charts  used  here.  For  example.  Figure  4  indicates 
some  residual  imbalance  on  the  percent  of  patients  with  poor  IV  contraction,  at  least  for 
patients  in  subclass  1,  that  is,  in  the  subclass  with  the  lowest  estimated  probabilities  of 
surgery.  This  imbalance  is  less  than  would  be  expected  from  randomization  within 
subclasses;  see  LVCDAC  in  Table  1.  Nontheless,  we  would  possibly  want  to  adjust  for  this 
residual  imbalance,  perhaps  using  methods  described  in  $3.3. 

2.3.  The  Fitted  Propensity  Score;  Overlap  of  Treated  and  Control  Groups 

Figure  6  contains  boxplots  (Tukey,  1977)  of  the  final  fitted  propensity  scores.  By 
construction,  most  surgical  patients  have  higher  propensity  scores,  that  is  higher 
eatiaiated  probabilities  of  surgery,  than  most  medical  patients.  There  are  a  few  surgical 
patients  with  higher  estimated  probabilities  of  surgery  than  any  medical  patient. 


-6- 


t 


BALANCE  BEFORE  AND  AFTER  SUBCLASSIFICRTION r  MAIN  EFFECTS 


F-RRTIO  BEFORE  SUBCLASSIFICRTION 


BALANCE  BEFORE  AND  AFTER  SUBCLASSIFICATION:  INTERACTION 


F-RflTIO  BEFORE  SUBCLHSSIFICRTION 


I 


m  mum  nwu 


Figure  3 


tn 

in 

cr 

-j 

o 

CD 

Z> 

c n 


2 

3 

4 
6 


•  2 


■v//;/zv/zv7zzz\ 


•  YZ/rf////7£ZZZ^ 

nnp»ipw|ffli|mpnnpiiipffl|mipmprHpni 

.00  .75  I.SO  2-25  5.00 


HERN  NUMBER  OF  01SERSE0  VESSELS 


IRLMCC  HITMIM  tUICLDSUt>  L*  CONTMCTION 


Figure  5 


MUMH  NITNIH  WKlMMli  UTT  IM1I  ITUOIII 


QttEOtCRL 

g^SURGlCRL 


Qhedicrl 

SURGICAL 


QhEOICRL 

SURGICAL 


ESTIMATEO  PROBABILITY  OF  SURGERY 


Figure  6 


BOXPLOTS  OF  THE  ESTIMATED  PROPENSITY  SCORE 

1  .Or- 

.9  - 

•8  - 

.7  “  f 

.6  - 

.5  - 

.4  -  —  — 

.3  - 

.2  - 

.1  - 

.ot- 

MEQICAL  PATIENTS  SURGICAL  PATIENTS 


-10- 


indicating  a  combination  of  covariate  values  not  appearing  in  the  medical  group.  For 
almost  every  medical  patient,  however,  there  is  a  surgical  patient  who  is  comparable  in  the 
sense  of  having  a  similar  estimated  probability  of  surgery. 

2.4.  Incomplete  Covariate  Information 

The  variables  in  Table  1  that  are  identified  by  a  dagger  (t)  were  not  measured  during 
the  early  years  of  the  study,  so  that  many  patients  are  missing  these  covariate  values.  If 
the  propensity  score  is  defined  as  the  conditional  probability  of  assignment  to  treatment  1 
given  the  observed  covariate  information  and  the  pattern  of  missing  data,  then  Appendix  A 
shows  that  subclassification  on  the  propensity  score  will  balance  both  the  observed  data 
and  the  pattern  of  missing  data.  Essentially,  we  estimated  the  probabilities  of  surgical 
treatment  separately  for  early  and  late  patients,  and  then  used  these  estimated 
probabilities  as  propensity  scores.  Subclassification  on  the  corresponding  population 
propensity  scores  can  be  expected  to  balance,  within  subclasses,  each  of  the  following: 

(a)  the  distribution  of  those  covariates  that  are  measured  for  both  early  and  late 
patients,  (b)  the  proportions  of  early  and  late  patients,  (c)  the  distribution  of  all 
covariates  for  the  late  patients.  (For  proof,  see  Corollary  1.1  of  Appendix  A.)  Table  1 

shows  that  the  observed  values  of  all  covariates  were  indeed  balanced  by  our  procedure. 

3.  Results:  Estimates  of  the  Average  Treatment  Effect 

3.1.  Functional  Improvement  as  the  Response  Variable:  Placebo  Effects 

In  this  section,  medicine  and  surgery  are  compared  with  respect  to  a  particular 
response,  namely  functional  improvement.  Functional  capacity  is  measured  by  the  crude, 
four  category  (I  *  best,  II,  III,  IV  «  worst)  New  York  Heart  Association  classification, 
which  measures  a  patient's  ability  to  perform  common  tasks.  The  current  study  is  confined 

to  patients  in  classes  II,  III,  or  IV  at  the  time  of  cardiac  catherterization,  i.e., 

patients  who  could  improve.  A  patient  is  defined  to  have  substantial  improvement  at  6 
months  after  cardiac  catherterization  if  he: 

1.  is  alive,  and 

2.  has  not  had  a  myocardial  infarction,  and 


-11- 


3.  is  in  class  1/  or  has  improved  by  two  classes 
(i.e.,  IV  to  II); 

otherwise,  the  patient  is  not  substantially  improved  at  six  months. 

It  should  be  noted  that  there  is  substantial  evidence  that  patients  suffering  from 
coronary  artery  disease  respond  to  placebos;  for  a  review  of  this  evidence,  see  Benson  and 
McCallie  (1979).  Part  or  all  of  the  treatment  effect  may  reflect  differences  in  the 
placebo  effects  of  the  two  treatments. 

3.2.  subclass  Specific  Results;  Direct  Adjustment 

The  proportions  improved  in  each  subclass  for  medicine  and  surgery  are  displayed  in 
Figure  7.  in  each  subclass,  the  proportion  improved  under  surgery  exceeds  the  proportion 
improved  under  medical  therapy. 

Each  subclass  contains  a  total  of  303  patients.  Therefore,  for  medical  therapy  and 
surgery,  the  directly  adjusted  proportions  improved,  with  subclass  total  weights,  are 
simply  the  averages  of  the  five  subclass  specific  proportions.  These  adjusted  proportions 
improved  are  .36  for  medicine  and  .67  for  surgery.  Standard  errors  for  the  adjusted 
proportions,  calculated  following  Mosteller  and  Tukey  (1977,  Chapter  11c),  are  .04  and  .05, 
respectively.  By  Corollary  4.2  of  Rosenbaum  and  Rubin  (1983a),  if  treatment  assignment  is 
strongly  ignorable,  the  difference  between  the  surgical  and  medical  adjusted  proportions 
would  be  asymptotically  unbiased  for  the  average  treatment  effect  if  subclasses  were 
exactly  homogeneous  in  the  estimated  propensity  score.  The  results  of  Cochran  (1968)  and 
Appendix  B  suggest  that  adjustment  with  the  five  subclasses  used  here  will  remove  about  90% 
of  the  initial  bias,  providing  treatment  assignment  is  strongly  ignorable.  Similar 
considerations  apply  to  the  subpopulation  specific  results  described  in  $3.3. 

3.3.  Adjustment  and  Estimation  Within  Subpooulations  Defined  by  x 

This  section  obtains  adjusted  estimates  of  the  probabilities  of  substantial 
improvement  at  six  months  for  subpopulations  of  patients  defined  by  the  number  of  diseased 
vessels  (H)  and  the  New  York  Heart  Association  functional  class  at  the  time  of  cardiac 
catherterization  (F),  To  avoid  an  excessive  number  of  subpopulations,  the  small  hut 


-12 


clinically  important  subset  of  patients  with  significant  laft  main  atanoaia  haa  been 
excluded. 

In  Table  2,  patianta  are  cross-classified  according  to  the  nuabar  of  diaaaaed 
veaaela  (N),  initial  functional  claaa  (r),  treatment  (Z),  subclass  (8),  and 
condition  at  six  months  (It  iaproved  “  substantial  isgtroveaant  aa  defined  in  f  3.1).  A 
loglinaar  model  which  fixed  the  IZN,  IZF,  ISN,  SZ,  SF,  FN  margins  provides  a  good  fit  to 
this  table  (likelihood  ratio  chi  square  122.5  on  120  degrees  of  freedom).  (Rare,  IZN 
denotes  the  marginal  table  formed  by  susating  the  entries  in  the  table  over  initial 
functional  class  F  and  subclass  S(  leaving  a  three  way  table.) 

The  directly  adjusted  estimates  in  Table  3  were  calculated  from  the  fitted  counts 
using  the  NFS  marginal  table  for  weights.  In  other  words,  within  each  subpopulation 
defined  by  the  number  of  diseased  vessels  (M)  and  the  initial  functional  class  (F), 
estimates  of  the  probabilities  of  improvement  are  adjusted  using  subclass  (S)  total 
weights.  For  example,  from  Table  2,  the  weight  applied  to  both  medical  and  surgical 
estimated  probabilities  at  N  “  ONE,  F  -  II,  subclass  1  is  proportional  to 
9  ♦  8*  1  ♦  0  •  18. 

The  key  observations  from  Table  3  are  the  followingi 

1.  In  all  six  subpopulations ,  the  estimated  probabilities  of  substantial  improvement 
at  six  months  are  higher  following  surgery  than  following  medical  treatment  (between  30« 
and  387%  higher).  The  estimated  probabilities  differ  least  for  one  vessel 

disease,  functional  class  IV,  and  differ  most  for  three  vessel  disease,  functional  class 
III.  As  noted  above,  these  differences  may  reflect  differences  in  placebo  effects  of  the 
two  treatments  (Benson  and  McCalls,  1979). 

2.  The  definition  of  substantial  improvement  has  resulted  in  lower  estimated 
probabilities  of  improvement  for  class  III  patients  than  for  class  II  and  class  IV 
patients. 

3.  The  estimated  probabilities  of  improvement  under  surgery  vary  less  than  the 
estimated  probabilities  of  improvement  under  medicine. 


-14- 


Table  2 


Counts  Within  Subpopulations 


Number 

of  Diseased 
Vassals 


Initial  Treatment 

Subclass 

Condition  at 

Functional 

6  Months 

Class 

(I) 

M 


P 


Z 


S  Improved  Hot  Improved 


ONE 


TWO 


II  MEDICAL 


SURGICAL 


III  MEDICAL 


SURGICAL 


IV  MEDICAL 


SURGICAL 


II  MEDICAL 


SURGICAL 


1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 


9 

6 

2 

1 

0 

1 

3 
1 
0 
1 

4 

1 

1 

1 

0 

0 

2 

1 

5 

3 

27 

15 

10 

5 
2 

1 

10 

4 

8 

12 

3 

9 

6 
1 
0 

3 
0 

6 

4 

2 


8 

4 

4 

3 
0 

0 

2 

1 

1 

0 

9 

9 

4 
2 
1 

0 

0 

2 

1 

1 

20 

15 

13 

6 

3 

2 

3 
6 
6 

5 

4 

10 

9 

10 

1 

0 

0 

4 

0 

2 


-15- 


< 


Table  2  (continued) 


THREE 


III  MEDICAL 


SURGICAL 


IV  MEDICAL 


SURGICAL 


II  MEDICAL 


SURGICAL 


III  MEDICAL 


SURGICAL 


IV  MEDICAL 


SURGICAL 


1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 


2 

2 

4 

8 

1 

0 

1 

4 
9 

11 

7 
12 
16 

6 

6 

2 

5 

8 
25 
27 

5 

9 

2 

2 

0 

3 
2 

4 

7 

8 

5 

3 

4 
1 
1 

2 

5 

7 
17 

14 

14 

21 

15 

8 
3 

1 

13 

20 

22 

36 


4 

9 

12 

4 
2 

0 

3 
1 

5 
5 

a 

15 

15 

11 

7 

1 

1 

2 

7 

13 

25 

9 

8 

4 

2 

0 

0 

1 

1 

4 

24 

23 

14 

13 

5 

5 

4 

3 

5 

6 

62 

39 

40 

29 

11 

4 

7 

7 

15 
18 


16- 


* 


i 


TUU  3 

Directly  Adjusted  Estlaatsd  Probabilities  of  Substantial 
iBrowwnt 


ll<  wudical  therapy 
S-surgery 


II 


N unbar 

One 

M 

.469 

o t 

S 

.70S 

Diseased 

Vessels 

Two 

M 

.404 

3 

.780 

Three 

M 

.248 

S 

.709 

initial  Functional  Class 


III 

IV 

M 

.277 

H 

.487 

S 

.629 

S 

.635 

M 

.221 

M 

.413 

S 

.706 

S 

.7)4 

M 

.133 

M 

.278 

S 

.649 

S 

.657 

-17- 


I 


t 


4.  Sensitivity  of  Estimates  to  th«  tometlon  of  Strongly  Ignorable  Treatment  Assignment 


The  utiutes  in  Saction  3  ara  approximately  unbiased  under  the  aasu^tion  of  strongly 
ignorable  treatment  assignment.  Rosenbaum  and  Rubin  (1983b)  develop  and  apply  to  the 
current  example  a  method  for  assessing  the  sensitivity  of  these  estimates  to  a  particular 
violation  of  strong  ignorability.  They  assume  that  treatment  assignment  is  not  strongly 
ignorable  given  the  observed  covariates  x,  but  is  strongly  ignorable  given  ( x , u ) , 
where  u  is  an  unobserved  binary  covariata.  The  estimate  of  the  average  treatment  effect 
is  recomputed  under  various  assumptions  about  u.  X  related  Bayesian  approach  is  developed 
by  Rubin  (1978). 

5.  Conclusions!  The  Propensity  Score  and  Multivariate  Subclassification 

With  just  five  subclasses  formed  from  an  estimated  scalar  propensity  score,  we  have 
substantially  reduced  the  bias  in  74  covariates  simultaneously.  Although  the  process  of 
estimating  the  propensity  score  for  use  in  balanced  subclassification  does  require  some 
care,  the  comparability  of  treated  and  control  patients  within  each  of  the  final  subclasses 
can  be  verified  using  the  simplest  statistical  methods,  and  therefore  results  based  on 
balanced  subclassification  can  be  persuasive  even  to  audiences  with  limited  statistical 
training.  The  same  subclasses  can  also  be  used  to  estimate  treatment  effects  within 
subpopulationa  defined  by  the  covariates  x.  Moreover,  balanced  subclassification  may  be 
combined  with  model  based  adjustments  to  provide  improved  estimates  of  treatment  effects 
within  subpopulations. 


-18- 


I 


Appendix  A:  Balancing  Properties  of  the  Propensity  Score  with  Incomplete  Data 

In  (2.4,  we  noted  that  several  covarlatee  were  missing  for  a  large  number  of 
patients.  Let  x*  be  a  p-coordlnate  vector,  where  the  jth  coordinate  of  x*  Is  a 
covarlate  value  If  the  Jth  covariate  was  observed,  and  is  an  asterisk  (*)  if  the  jth 
covariate  is  adsaing.  (Formally,  x*  is  an  element  of  {*,  •J*’.)  Then  e*  *  pr ( ** 1 1 x* ) 
is  a  generalised  propensity  score.  The  following  theorem  and  corollary  show  the  e*  has 
balancing  properties  that  are  similar  to  tha  balancing  properties  of  the  propensity  score 
e. 

Theorem  1.  x*  JJ_  *|e* 

Proof i  Identical  to  the  proof  of  theorem  1  of  Rosenbaum  and  Rubin  (1983a),  with  x*  in 
place  of  x  and  e*  in  place  of  e. 

Theorem  1  Implies  that  subclassif icatior  on  the  generalised  propensity  score  a* 
balances  the  observed  covariate  information  and  the  pattern  of  missing  covariates.  Note 
that  Theorem  1  does  not  generally  imply  that  subclassification  on  a*  balances  the 
unobserved  coordinates  of  xi  that  is.  Theorem  1  does  not  generally  imply 

*  JLL  *  I  •*  • 

The  consequences  of  Theorem  1  are  clearest  when  there  are  only  two  patterns  of  missing 
data,  with  x  “  <*1  ' »  where  ;1  is  always  observed  and  x2  is  soatetimes  missing. 

Let  c  “  1  when  x_  is  observed,  and  let  c  ■  0  whan  x,  is  missing.  Then 
••  •  prtx-lljCj.Xj.c-D  for  units  with  x2  observed,  and  e*  “  pr(*“1  Ix^.c-O)  for  units 
with  x2  missing.  Subclasses  of  units  may  be  formed  using  e*,  ignoring  the  pattern  of 
missing  data. 

Corollary  1.1. 

A.  For  units  with  x^  missing,  there  is  balance  on  x^  at  each  value 
of  e*,  that  is. 

Si  JLL  *••*»  c  -  o  . 

B.  For  units  with  x  observed,  there  is  balance  on  (x.,x„)  at  each 

“1  ~2  I 

value  of  s*,  that  is, 

<Si'S2)  i L  c  _  1  • 

-19- 

1 

1 


I 


C.  There  Is  balance  on  at  each  value  of  «*,  that  la, 

x,  JJ.  *|e*  . 

D.  The  frequency  of  missing  data  Is  balanced  at  each  value  of  e*,  that  Is, 

c  XL  *l«*  • 

Prooft  Parts  A  and  B  follow  Immediately  from  Theorem  1  of  Rosenbaum  and  Rubin  (1983a),  and 
Parts  C  and  D  following  immediately  from  Theorem  1  above. 

In  practice,  we  may  estimate  e*  in  several  ways.  In  a  large  study  with  only  a  few 
patterns  of  missing  data,  we  may  use  a  separate  logit  model  for  each  pattern  of  missing 
data.  In  general,  however,  there  are  2P  potential  patterns  of  missing  data  with  p 
covariates.  If  the  covariates  are  discrete,  than  we  may  estimate  e*  by  treating  the  * 
as  an  additional  category  for  each  of  the  p  covariates. 


-20- 


I 


Apandlx  Bi  The  Effactlvanas a  o f  Subclassiflcation  on  the  Propensity  Score  in  Removing  Bias 
Cochran  (1968)  studied  the  effectiveness  of  univariate  aubclaseification  in  rewving 
bias  in  observational  studies.  In  this  appendix,  we  show  how  Cochran's  results  are  related 
to  subclassification  on  the  propensity  score. 

Let  f  »  f(x)  be  any  scalar  valued  function  of  x.  The  initial  bias  in  f  is  Bj  ■ 
E(f|z-1)  -  E( f | s-0 ) .  The  bias  in  f  after  subclassification  on  the  propenelty  score  and 
direct  adjustsnnt  with  subclass  total  weights  is 
J 

B_  -  l  {E(f|^1,  e  «  I.)  -  E(fIs-0,  e  «  I )  )pr(e  €  I.) 

8  j-1  3  33 

where  there  are  J  subclasses,  and  I  j  is  the  set  of  values  of  e  that  define  the  jth 

subclass.  The  percent  reduction  in  bias  in  f  due  to  subclassiflcation  on  the  propensity 

®s 

score  is  10011  -  . 

BI 

Cochran's  (1968)  results  do  not  directly  apply  to  subclassification  on  the  propensity 
score  since  his  work  is  concerned  with  the  percent  reduction  in  bias  in  f  after 
subclassiflcation  on  f,  rather  than  the  percent  reduction  In  bias  in  f  after 
subclassification  on  e.  Nonetheless,  as  the  following  theorem  shows,  Cochran's  results 
are  applicable  providing  (a)  the  conditional  expectation  of  f  given  e,  that  is 
E(f|e)  <■  f,  say,  is  a  eo notone  function  of  a,  and  (b)  f  has  one  of  the  distributions 
studied  by  Cochran.  In  particular,  under  these  conditions,  subclassification  at  the 
quintiles  of  the  distribution  of  the  propensity  score,  a,  will  produce  approximately  a 
90%  reduction  in  the  bias  of  f.  Note  that  in  the  following  theorem,  Cochran's  (1966) 
results  apply  directly  to  the  problasi  of  determining  the  percent  reduction  in  bias  in 
f  after  subclassiflcation  on  f. 

Ss 

Theorem  2i  The  percent  reduction  in  the  bias,  100(1  -  —),  in  f  following 

BI 

subclassification  at  specified  quantiles  of  the  distribution  of  the  propensity  score,  e, 
equals  the  percent  reduction  in  bias  in  f  after  subclassiflcation  at  the  same  quantiles 
of  the  distribution  of  f,  providing  f  is  a  strictly  monotone  function  of  e. 


< 


Proof:  First,  we  show  that  within  a  subclass  defined  by  e  e  S,  the  bias  in  f  equals 
the  bias  in  ft  that  is,  we  show  that 

(B.1)  E(  f  I  e  e  S,  t-1)  -  E(f |e  e  S,  *“0>  -  Btfle  e  S,  ft )  -  E(f|e  e  S,  *-0)  . 

To  show  this  it  is  sufficient  to  observe  that  for  t  “  0,  1, 

E(f|e  e  S,  z“t)  “  E{E(f|e,  e  e  S,  *»t)|e  e  S,  *“t} 

-  E{E(f |e) |e  e  S,  *-t} 

•  E(f|e  e  S,  r«t) 

where  the  second  equality  follows  from  the  fact  that  a  is  the  propensity  score  (i.e. , 
from  equation  ( 1 ) ) . 

From  (B.1)  with  S  ■  (-*•,«),  it  follows  that  the  initial  bias  in  f  equals  the 
initial  bias  in  f.  To  complete  the  proof,  we  need  to  show  that  the  bias  in  f  after 
subclassification  on  e  equals  the  bias  in  f  after  subclassification  on  f.  Since  by 
assumption  f  is  a  strictly  monotone  function  of  e,  subclasses  defined  at  specified 
quantiles  of  the  diatribtulon  of  a  contain  exactly  the  same  units  as  subclasses  defined 
at  the  same  quantiles  of  the  distribution  of  f.  It  follows  from  this  observation  and 
(B.1)  that  the  bias  in  f  within  aach  subclass  definsd  by  e  equals  the  bias  in  f 
within  each  subclass  defined  by  f.  Since  (a)  ths  Initial  biases  in  f  and  f  are  equal, 
(b)  the  subclasses  formed  from  e  contain  the  same  units  as  the  subclasses  formed  from 
f,  and  (c)  within  each  subclass,  the  bias  in  f  equals  the  bias  in  f,  it  follows  that 
the  percent  reduction  in  bias  in  f  after  subclassification  on  e  equals  the  percent 
reduction  in  bias  in  f  after  subclassification  on  f.  // 


Acknowledgement 

The  authors  acknowledge  valuable  conversations  with  A.  P.  Dempster  on  the  issues 
discussed  in  this  paper. 


-22- 


REFERENCES 


Benson,  H.  and  McCallie,  D.  (1979).  angina  pectoris  and  the  placebo  effect.  New  England 
Journal  of  Medicine.  300,  1424-1428. 

Cochran,  M.  G.  (1965).  The  planning  of  observational  studies  of  human  populations. 

Journal  of  the  Royal  Statistical  Society.  Series  A  128,  234-255. 

Cochran,  W.  G.  (1968).  The  effectiveness  of  adjustment  by  subclassification  in  removing 
bias  in  observatinal  studies.  Biometrics.  24,  205-213. 

Cox,  D.  R.  (1970).  The  Analysis  of  Binary  Data.  Londons  Methuen. 

Dawid,  A.  P.  (1979).  Conditional  independence  in  statistical  theory  (with  discussion). 
Journal  of  the  Royal  Statistical  Society.  Series  B,  41,  1-31. 

Dempster,  A.  P.,  Laird,  N.  M. ,  and  Rubin,  D.  B.  (1977).  Maximum  likelihood  from  incomplete 
data  using  the  EM  algorithm  (with  discussion).  Journal  of  the  Royal  Statistical 
Society.  Series  B,  39,  1-38. 

Miettinen,  O.  (1976).  Stratification  by  a  mUti  variate  confounder  score.  American  Journal 
of  Epidemiology .  104  s  609-620. 

Moeteller,  C.  F.  a  Tukey,  J.  W.  (1977).  Data  Analysis  and  Regression.  Reading,  MAt 
Addison-Wesley. 

Rosenbaum,  P.  R.  (1982).  Testing  the  assumption  of  strongly  ignorable  treatment  assignment 
in  observational  studiess  A  review  within  a  general  framework.  Submitted  to  the 
Journal  of  the  American  Statistical  Association. 

Rosenbaum,  P.  R.  «  Rubin,  0.  B.  (1983,a).  The  central  role  of  the  propensity  score  in 
observational  studies  for  causal  effects.  To  appear  in  Biometrika.  70,  *1. 

Rosenbaum,  P.  R.  a  Rubin,  D.  B.  (1983, b).  Assessing  sensitivity  to  an  unobserved  binary 
covarlate  in  an  observational  study  with  binary  outcome.  To  appear  in  the  Journal 
of  the  Roval  Statistical  Society.  Series  B,  45,  #2. 

Rubin,  D.  B.  (1976).  Inference  and  missing  data.  Biometrika  63,  581-592. 

Rutin,  D.  B.  (1978).  Bayesian  inference  for  causal  effects!  The  role  of  randomization. 
Annals  of  Statistics.  6,  34-58. 

Tukey,  J.  W,  (1977).  Exploratory  Data  Analysis.  Reading,  MA :  Addison-Wesley. 

PRR/DBR/jva 

-23- 


i 


SECURITY  CLASSIFICATION  OF  THIS  PAGF.  (*htn  Data  Entered) 


REPORT  DOCUMENTATION  PAGE 


1.  REPORT  NUMBER 

#2466 


READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 


».  RECIPIENT'S  catalog  number 


7 16d 


4.  TITLE  Can d  Subtitle) 

Balanced  Subclassification  in  Observational 
Studies  Using  the  Propensity  Score : 

A  Case  Study 


7.  AUTHORfaJ 

Paul  R.  Rosenbaum  and  Donald  B.  Rubin 


t.  PERFORMING  organization  name  and  AOORESS 

Mathematics  Research  Center,  University  of 
610  Walnut  Street  Wisconsin 

Madison.  Wisconsin  53706 


II.  CONTROLLING  OFFICE  NAME  AND  ADDRESS 

See  Item  18  below 


s.  type  of  report  a  period  covered 

Summary  Report  -  no  specific 
_ reporting  period 


t-  PERFORMING  ORC.  REPORT  NUMBER 


10.  PROGRAM  ELEMENT.  PROJECT.  TASK 
AREA  *  WORK  UNIT  NUMBERS 

Work  Unit  Number  4  - 
Statistics  &  Probability 


12.  REPORT  DATE 

January  1983 


I*.  NUMBER  OF  PAGES 

23 


4.  MONITORING  AGENCY  NAME  A  AOORESSfJf  dlllerent  from  Controlling  Ollleo)  15.  SECURITY  CLASS,  (at  title  report) 

UNCLASSIFIED 

15«.  DECLASSIFICATION/  DOWNGRADING 
SCHEDULE 


U.  DISTRIBUTION  STATEMENT  (ot  thle  Report) 

Approved  for  public  release;  distribution  unlimited. 


17.  DISTRIBUTION  STATEMENT  (ot  Ida  aba  tract  an  farad  h,  Block  20,  II  dlllerent  from  Ra port) 


IS.  SUPPLEMENTARY  NOTES 

U.  S.  Army  Research  Office  National  Cancer  Institute 
P.  O.  Box  12211  9000  Rockville  Pike 

Research  Triangle  Park  Bethesda,  MD  20205 

North  Carolina  27709 


19.  KEY  WORDS  (Contlnuo  on  toeoeoo  old*  It  n«c«at«7  end  identity  by  block  number) 


Educational  Testing 
Service 
Carter  Road 
Princeton,  NJ  08541 


Observational  studies;  bias  reduction;  stratification;  logistic  models;  log 
linear  models;  direct  adjustment;  balancing  score. 


<0,  ABSTRACT  (Contlnuo  on  tovotoo  olt lo  If  n«c«M«7  end  Identify  by  block  ntmbot) 

The  propensity  score  is  the  conditional  probability  of  assignment  to  a 
particular  treatment  given  a  vector  of  observed  covariates.  Previous  theoreti¬ 
cal  arguments  have  shown  that  subclassification  on  the  scalar  propensity  score 
will  balance  all  observed  covariates.  The  procedure  is  illustrated  in  a  large 
observational  study  of  treatments  for  coronary  artery  disease.  Five  subclasses 
are  constructed  that  balance  74  covariates.  Balanced  subclassification  is 
combined  with  model-based  adjustments  to  provide  estimates  of  treatment  effects 
within  subpopulations.  Two  appendices  address  theoretical  issues: 


VOITION  OF  1  NOV  AS  IS  OBSOLETE 


UNCLASSIFIED  (next  page) 

SECURITY  CLASSIFICATION  or  This  PACE  fftfctn  Date  fin  tied) 


I 


ABSTRACT  (continued) 


(A)  propensity  scores  from  incomplete  data,  and  (B)  the  effectiveness  of 
subclassification  on  the  propensity  score. 


END 


DATE 

FILMED 


»•>*  *0** 


